DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Caitlin Rowley - HW3

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Read in Data
  • Tidy Data
  • Summary of Data
  • Mutate
  • Summary Statistics
  • Visualization with Multiple Dimensions

Caitlin Rowley - HW3

Author

Caitlin Rowley

Published

November 25, 2022

#| label: setup
#| warning: false
#| message: false

# install packages and load libraries:

install.packages("readr")
Installing package into 'C:/Users/srika/AppData/Local/R/win-library/4.2'
(as 'lib' is unspecified)
Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
install.packages("readxl")
Installing package into 'C:/Users/srika/AppData/Local/R/win-library/4.2'
(as 'lib' is unspecified)
Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
library(tidyverse)
── Attaching packages
───────────────────────────────────────
tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
Warning: package 'ggplot2' was built under R version 4.2.2
Warning: package 'stringr' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(readr)
library(readxl)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in Data

I selected a data set from the Inside AirBnB website, capturing data related to summary information and metrics for listings in Boston, MA.

# read in dataset:

setwd("C:/Users/caitr/OneDrive/Documents/DACSS/601_Fall_2022/posts")
Error in setwd("C:/Users/caitr/OneDrive/Documents/DACSS/601_Fall_2022/posts"): cannot change working directory
Boston <- read_csv("Boston AirBnB Data.csv")
Error: 'Boston AirBnB Data.csv' does not exist in current working directory ('C:/Users/srika/OneDrive/Desktop/601_Fall_2022/posts').

Tidy Data

I will now tidy the data to look for missing values and duplicates. I will also rename columns as needed.

# look for duplicates
# look for missing values
# remember na.rm=TRUE for calculations

# At first glance, it seems as though there are no values in the column titled "neighbourhood_group." So, I will find all unique values within that column to determine whether it can be removed from my tidy data set.

unique(Boston[c("neighbourhood_group")])
Error in unique(Boston[c("neighbourhood_group")]): object 'Boston' not found
# I now know that there is no data within this column. I will remove it from my data set.

Boston_tidy <- subset(Boston, select = -c(neighbourhood_group))
Error in subset(Boston, select = -c(neighbourhood_group)): object 'Boston' not found
# I can see from viewing this data frame that there are no other columns that are absent any values, so I will move on to other tidying tasks.

# rename columns:

names(Boston_tidy) <- c('room_id', 'room_name', 'host_id', 'host_name', 'neighborhood', 'room_latitude', 'room_longitude', 'room_type', 'room_price', 'min_nights', 'number_reviews', 'last_review', 'reviews_per_month', 'host_listings', 'availability_next_365', 'number_reviews_LTM', 'room_license')
Error in names(Boston_tidy) <- c("room_id", "room_name", "host_id", "host_name", : object 'Boston_tidy' not found
# find duplicates:

duplicates <- duplicated(Boston_tidy)
Error in duplicated(Boston_tidy): object 'Boston_tidy' not found
# reached "max.print", so I will increase the limit and identify if any values within the vector = TRUE:

options(max.print=999999)
duplicates["TRUE"]
Error in eval(expr, envir, enclos): object 'duplicates' not found
head(Boston_tidy)
Error in head(Boston_tidy): object 'Boston_tidy' not found

Summary of Data

“Boston_tidy” represents AirBnB rental listing data for the city of Boston over the last twelve months. The data frame has 17 variables and 5,185 rows of data. Each row now represents one unique observation—or in this case, a unique rental listing—that includes data related to the following variables: (1) room/listing ID number, (2) name of the room/listing, (3) listing host ID number, (4) listing host name, (5) room/listing neighborhood, (6) room/listing latitude, (7) room/listing longitude, (8) type of room/listing, (9) room/listing price, (10) minimum number of nights for rent, (11) number of room/listing reviews, (12) most recent room/listing review, (13) number of room/listing reviews per month, (14) number room-specific host listings (i.e., the number of unique room listings by host), (15) room/listing availability over the next year, (16) number of reviews for room/listing over the past 12 months, and (17) room/listing licensure status.

Some potential research questions include:

  • Which neighborhoods have the most expensive price per bed?

  • Which neighborhoods have the highest number of listings?

  • Which property types are listed most frequently?

  • What’s the average listing price by property types?

  • What’s the average price per bed price by property type?

  • What does the occupancy rate look like by neighborhood?

  • What factors affect the price the most?

  • Can the price be accurately predicted, given other information about the listing?

  • Is there a correlation between number of reviews and occupancy?

Mutate

Next, I will mutate variables. I will start off by adding the variable “room_coordinates” to my overall data set. I think this may come in handy if I choose to use a map for visualization, as I may need to match coordinates between my data set and those included in mapping packages such as ‘map_data().’

# mutate lat and lon to create "room_coordinates"
# keep lat and lon columns for now

Boston_mutate <- Boston_tidy %>%
mutate("room_coordinates" = paste(room_latitude, room_longitude))
Error in mutate(., room_coordinates = paste(room_latitude, room_longitude)): object 'Boston_tidy' not found
colnames(Boston_mutate)
Error in is.data.frame(x): object 'Boston_mutate' not found

Summary Statistics

I will next create a subset of data that includes the new variable “median_price,” which will only include room prices greater than $0. This data set will also grouped by the original variables “room_type” and “neighborhood.” This will be useful in terms of both summary statistics and visualization.

# find median room prices by neighborhood and room type:

Boston_median <- Boston_mutate%>%
  filter(room_price>0) %>%
  group_by(room_type, neighborhood)%>%
    summarize(median_price = median(room_price))
Error in filter(., room_price > 0): object 'Boston_mutate' not found
print(head(Boston_median))
Error in head(Boston_median): object 'Boston_median' not found
summary(Boston_median$median_price)
Error in summary(Boston_median$median_price): object 'Boston_median' not found

We can see from this data set that the highest median room price is $750/night for a shared room in the Fenway neighborhood. The lowest median room price is $10/night for a shared room in Charlestown.

I will next generate statistics using raw data (not excluding outliers or values equal to 0) related to room price, minimum number of nights per stay, number of reviews per room, and number of listings by host.

# summary statistics for entire data set: 

summary.data.frame(Boston_mutate)
Error in as.list(object): object 'Boston_mutate' not found
# summary statistics for particular group of variables:

Boston_mutate %>% 
  select(room_price, min_nights, number_reviews, host_listings, availability_next_365) %>% 
  summary()
Error in select(., room_price, min_nights, number_reviews, host_listings, : object 'Boston_mutate' not found

This raw data indicates that room prices for Boston Airbnbs range from $0-$10,000 per night—both of which I would assume are outliers—with the median price equaling $179 per night and the average price equaling $231 per night. Regarding the minimum number of nights per stay, values ranged from 1-730 nights. At first, I assumed the maximum value was an outlier, but Airbnb does offer long-term stays (“long-term” being defined as more than 28 days), so it is possible that this particular listing is for long-terms stays only. I will dive deeper into this later to see if this is a commonality within the data set, or if it is truly an outlier. I also included the number of reviews per listing in this analysis to see if this may be an indicator of the popularity of certain rooms and, by extension, certain hosts. I will delve into this in my final project, as well. In the same vein, I included number of room-specific host listings in this summary, with the values ranging from 1-477 listings. The median number of room-specific listings per host is 6, while the average is 62. The final component of this analysis is listing availability over the next 365 days. The values range from 0-365 days, with the median value being 187 days and the average being 190 days.

As a precursor to a deeper analysis on number of host listings, I filtered the data set to include only values greater than one in the “host_listings” column, which tells us the number of rooms listed by the same host.

# filter by hosts with more than one listing:

Boston_id <- Boston_mutate%>%
  filter(host_listings>1)%>%
  group_by(host_id, host_listings)
Error in filter(., host_listings > 1): object 'Boston_mutate' not found
head(Boston_id)
Error in head(Boston_id): object 'Boston_id' not found

This output indicates that that there are 3,918 room-specfic listings whose hosts have more than one unique listing in Boston’s Airbnb database.

Additionally, we can dig a little deeper into the number of room-specific listings by host.

max_listings <- Boston_mutate%>%
  select(host_id, host_name, room_name, neighborhood, host_listings)
Error in select(., host_id, host_name, room_name, neighborhood, host_listings): object 'Boston_mutate' not found
max_listings[max_listings$host_listings == '477',]
Error in eval(expr, envir, enclos): object 'max_listings' not found
head(max_listings)
Error in head(max_listings): object 'max_listings' not found

We now know that there are 477 unique listings whose hosts also have 477 unique listings.

I will next generate some summary statistics for the categorical variable indicating listing neighborhood.

# summary statistics by neighborhood:

unique_neighbor <- unique(Boston_mutate$neighborhood)
Error in unique(Boston_mutate$neighborhood): object 'Boston_mutate' not found
unique[which.max(tabulate(match(Boston_mutate$neighborhood, unique)))]
Error in match(Boston_mutate$neighborhood, unique): object 'Boston_mutate' not found

We can see from this tabulation that the most frequently listed neighborhood is Allston.

Visualization with Multiple Dimensions

I will next focus on data visualization. I will first generate a bar chart portraying median room price by neighborhood. This visual does not exclude outliers, though it will exclude room prices that equal zero.

library(RColorBrewer)
library(ggtext)
Error in library(ggtext): there is no package called 'ggtext'
library(ggplot2)

# group median price by neighborhood:

Boston_median_price <- Boston_mutate%>%
  filter(room_price>0) %>%
  group_by(neighborhood)%>%
    summarize(median_price = median(room_price))
Error in filter(., room_price > 0): object 'Boston_mutate' not found
# generate bar chart:

ggplot(Boston_median_price, aes(x=neighborhood, y=median_price, fill=neighborhood)) +
geom_bar(stat="identity") +
scale_fill_hue() +
  theme_classic() +
  labs(x="Neighborhood",y="Median Price per Night", title = "Boston Airbnb Rental Prices by Neighborhood")+
  theme(axis.text.x = element_markdown(angle=90, hjust=1))
Error in ggplot(Boston_median_price, aes(x = neighborhood, y = median_price, : object 'Boston_median_price' not found

This bar chart tells us that the neighborhoods with the highest median room price per night are (1) Chinatown at just under $400/night , (2) Back Bay at just under $300/night, and (3) Downtown at about $260/night. The neighborhoods with the lowest median prices are Roxbury at about $80/night, (2) Dorchester at just under $100/night, and (3) Hyde Park at about $100/night. These values are confirmed in the data frame “Boston_median_price.”

I will next generate a geom_point chart to visualize room price by neighborhood. I will also use facet wrapping to separate the values by room type. I will also apply a boxplot overlay to capture both the interquartile range and outliers. I will first need to exclude strong outliers

# remove outliers:

is_outlier <- function(x) {
  return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

Boston_outlier <- Boston_mutate %>%
  filter(!is_outlier(room_price))
Error in filter(., !is_outlier(room_price)): object 'Boston_mutate' not found
# create dataframe:

Boston_outlier%>%
  filter(room_price>0, room_price<800) %>%
  group_by(room_type, neighborhood)
Error in filter(., room_price > 0, room_price < 800): object 'Boston_outlier' not found
# generate geom_point chart
# facet wrap
# boxplot overlay

Boston_outlier%>%
 group_by(room_type, neighborhood)%>%
  ggplot(aes(x=neighborhood, y=room_price)) +
  geom_point(alpha=.08, size=3, color = "light pink")+
  facet_wrap("room_type")+
  labs(x="Neighborhood",y="Price per Night", title = "Boston Airbnb Rental Prices by Neighborhood and Room Type")+
  theme_light()+
  geom_boxplot()+
  theme(axis.text.x = element_markdown(angle = 90, hjust=1))
Error in group_by(., room_type, neighborhood): object 'Boston_outlier' not found

This is very difficult to read due to the number of neighborhoods, so I am going to apply the three variables (room price, room type, and neighborhood) to another visual.

In the interim, I will display a simpler version of this geom_point chart without the facet wrap so that the visual only captures neighborhood and room price per night.

# generate geom_point chart with boxplot:

Boston_outlier%>%
 group_by(room_type, neighborhood)%>%
  ggplot(aes(x=neighborhood, y=room_price)) +
  geom_point(alpha=.08, size=5, color = "light pink")+
  labs(x="Neighborhood",y="Price per Night", title = "Boston Airbnb Rental Prices by Neighborhood")+
  theme_light()+
  geom_boxplot()+
  theme(axis.text.x = element_markdown(angle = 90, hjust=1))
Error in group_by(., room_type, neighborhood): object 'Boston_outlier' not found
# want to add values: text(x = Boston_outlier$room_price, y = Boston_outlier$room_price, labels = Boston_outlier$room_price)

Here, we can see the distribution of prices across neighborhoods using individual data points. We can also see the spread of data points with the boxplot overlay, which includes the minimum value, the values within the 25th quartile, the median value, the values within the 75th quartile, and the maximum value. The boxplot also indicates outliers. With this visualization, we can see that neighborhoods with the narrowest distribution of data points—or, in this case, room prices—are Chinatown and the Leather District, while the neighborhoods with the broadest distribution of data points seem to be Charlestown, Harbor Islands and Mattapan. In my final project, I’d like to add the value lables to the boxplots to confirm this.

I will next visualize the data using a choropleth map. I will generate a map of the Boston area and apply data related to neighborhood, room price, and room type.

library(maps)
library(viridisLite)
library(ggplot2)
library(tidyverse)

# generate map

states_map <- map_data("state")
head(states_map)
       long      lat group order  region subregion
1 -87.46201 30.38968     1     1 alabama      <NA>
2 -87.48493 30.37249     1     2 alabama      <NA>
3 -87.52503 30.37249     1     3 alabama      <NA>
4 -87.53076 30.33239     1     4 alabama      <NA>
5 -87.57087 30.32665     1     5 alabama      <NA>
6 -87.58806 30.32665     1     6 alabama      <NA>
ma_map <- filter(states_map, region=="massachusetts") %>%
ggplot(., aes(x=long, y=lat, group=group)) +
  geom_polygon(fill="white", color="black")
print(ma_map)

I’ve generated the map of Massachusetts, so now I will work on merging my data sets to apply as an overlay to the map.

# merge 'ma_map' and 'Boston_tidy' by coordinates

# mutate and rename columns

Boston_coord <- Boston_mutate %>%
  rename("coordinates" = "room_coordinates")
Error in rename(., coordinates = "room_coordinates"): object 'Boston_mutate' not found
head(Boston_coord)
Error in head(Boston_coord): object 'Boston_coord' not found
ma_map_df <- filter(states_map, region=="massachusetts")
ma_mutate <- ma_map_df %>%
  mutate("coordinates" = paste(lat, long))
head(ma_mutate)
       long      lat group order        region         subregion
1 -70.45089 41.40193    20  5926 massachusetts martha's vineyard
2 -70.45662 41.39047    20  5927 massachusetts martha's vineyard
3 -70.45662 41.37328    20  5928 massachusetts martha's vineyard
4 -70.46808 41.35609    20  5929 massachusetts martha's vineyard
5 -70.50819 41.35609    20  5930 massachusetts martha's vineyard
6 -70.56548 41.34464    20  5931 massachusetts martha's vineyard
                         coordinates
1  41.401927947998 -70.4508895874023
2 41.3904724121094 -70.4566192626953
3 41.3732833862305 -70.4566192626953
4 41.3560943603516 -70.4680786132812
5  41.3560943603516 -70.508186340332
6 41.3446350097656 -70.5654830932617
# merge data 

map_merge <- merge(ma_mutate, Boston_coord, by = "coordinates", all=T)
Error in as.data.frame(y): object 'Boston_coord' not found
head(map_merge)
Error in head(map_merge): object 'map_merge' not found
# remove if room_id is NA when merged with map data

map_merged <- map_merge %>% filter(!is.na(map_merge$room_id))
Error in filter(., !is.na(map_merge$room_id)): object 'map_merge' not found
head(map_merged)
Error in head(map_merged): object 'map_merged' not found

I now have my map and my merged data, but I’d like to zoom in on Boston, as the data would be illegible if it the map is kept at its current scale. I will try an iteration of the above code to generate this visual.

# load map, plot data

MA_map <- map_data("state")
filter(states_map, region == "massachusetts")%>%
  ggplot() + geom_polygon(data = map_merged, aes(x = long, y = lat, group = group), colour = "black", fill = NA) + geom_point(data = Boston_mutate, aes(x = room_latitude, y = room_longitude, size = room_price, color = neighborhood)) + coord_map()
Error in fortify(data): object 'map_merged' not found

This x-axis range on this visual is too narrow, and there are too many data points included. For my final project, I will work on expanding the limits of the x-axis, and I will also consider ways that I can slice or group the data to make the visual more readable. I will also consider additional variables that I can visualize.